Make IFBench eval tolerate missing or whitespace-shifted responses#27
Make IFBench eval tolerate missing or whitespace-shifted responses#27resolvicomai wants to merge 1 commit into
Conversation
|
/claim https://algora.io/PrimeIntellect-ai/bounties/dderbjHtPwTiGVY4 Scope note: this PR targets the evaluator-reliability path for the IF-RLVR/Bench bounty. It is independent from trainer/reward-helper integrations such as #28: it makes the existing eval path runnable on bundled sample outputs, normalizes prompt whitespace, and treats missing generations as failed rows instead of aborting the run. |
|
Friendly bounty-review ping: this PR targets the evaluator-reliability path for the IF-RLVR/Bench bounty. It keeps the scope small by making the existing eval path complete on bundled sample outputs, normalizing prompt whitespace, and scoring missing generations as failed rows instead of aborting. Happy to adjust if the sponsor wants a different slice for the bounty. |
Summary
KeyErrorWhy
The current sample evaluation can fail before scoring because
data/sample_output.jsonlhas prompts with trailing whitespace and fewer rows thandata/IFBench_test.jsonl. For eval and RLVR-style reward loops, a missing generation should produce a failed score, not abort the whole run.Context: Prime Intellect IF-RLVR/Bench Algora bounty: https://algora.io/PrimeIntellect-ai/bounties/dderbjHtPwTiGVY4
Validation
uv run pytest -qrm -rf /tmp/ifbench-eval && uv run python -m run_eval --input_data=data/IFBench_test.jsonl --input_response_data=data/sample_output.jsonl --output_dir=/tmp/ifbench-eval/claim https://algora.io/PrimeIntellect-ai/bounties/dderbjHtPwTiGVY4